Search CORE

arXiv.org e-Print Archive

Detectability of Varied Hybridization Scenarios using Genome-Scale Hybrid Detection Methods

Author: Bjorner Marianne
Dewey Colin N.
Molloy Erin K.
Solis-Lemus Claudia
Publication venue
Publication date: 01/11/2022
Field of study

Hybridization events complicate the accurate reconstruction of phylogenies, as they lead to patterns of genetic heritability that are unexpected under traditional, bifurcating models of species trees. This has led to the development of methods to infer these varied hybridization events, both methods that reconstruct networks directly, and summary methods that predict individual hybridization events. However, a lack of empirical comparisons between methods - especially pertaining to large networks with varied hybridization scenarios - hinders their practical use. Here, we provide a comprehensive review of popular summary methods: TICR, MSCquartets, HyDe, Patterson's D-Statistic (ABBA-BABA), D3, and Dp. TICR and MSCquartets are based on quartet concordance factors gathered from gene tree topologies and Patterson's D-Statistic, D3, and Dp use site pattern frequencies to identify hybridization events. We then use simulated data to address questions of method accuracy and ideal use scenarios by testing methods against complex networks which depict gene flow events that differ in depth (timing), quantity (single vs. multiple, overlapping hybridizations), and rate of gene flow. We find that deeper or multiple hybridization events may introduce noise and weaken the signal of hybridization, leading to higher false negative rates across methods. Despite some forms of hybridization eluding quartet-based detection methods, MSCquartets displays high precision in most scenarios. While HyDe results in high false negative rates when tested on hybridizations involving ghost lineages, HyDe is the only method to be able to separate hybrid vs parent signals. Lastly, we test the methods on ultraconserved elements from the bee subfamily Nomiinae, finding the possibility of hybridization events between clades which correspond to regions of poor support in the species tree estimated in the original study

A Genome-Wide Map of Conserved MicroRNA Targets in C. elegans

Author: Bray Nicolas
Chen Kevin
Colombo Teresa
Dewey Colin N.
Grün Dominic
Gunsalus Kristin C.
Kao Huey-Ling
Krek Azra
Lall Sabbi
MacMenamin Philip
Pachter Lior
Piano Fabio
Rajewsky Nikolaus
Sood Pranidhi
Wang Yi-Lu
Publication venue: Elsevier Ltd.
Publication date: 01/01/2006
Field of study

SummaryBackgroundMetazoan miRNAs regulate protein-coding genes by binding the 3′ UTR of cognate mRNAs. Identifying targets for the 115 known C. elegans miRNAs is essential for understanding their function.ResultsBy using a new version of PicTar and sequence alignments of three nematodes, we predict that miRNAs regulate at least 10% of C. elegans genes through conserved interactions. We have developed a new experimental pipeline to assay 3′ UTR-mediated posttranscriptional gene regulation via an endogenous reporter expression system amenable to high-throughput cloning, demonstrating the utility of this system using one of the most intensely studied miRNAs, let-7. Our expression analyses uncover several new potential let-7 targets and suggest a new let-7 activity in head muscle and neurons. To explore genome-wide trends in miRNA function, we analyzed functional categories of predicted target genes, finding that one-third of C. elegans miRNAs target gene sets are enriched for specific functional annotations. We have also integrated miRNA target predictions with other functional genomic data from C. elegans.ConclusionsAt least 10% of C. elegans genes are predicted miRNA targets, and a number of nematode miRNAs seem to regulate biological processes by targeting functionally related genes. We have also developed and successfully utilized an in vivo system for testing miRNA target predictions in likely endogenous expression domains. The thousands of genome-wide miRNA target predictions for nematodes, humans, and flies are available from the PicTar website and are linked to an accessible graphical network-browsing tool allowing exploration of miRNA target predictions in the context of various functional genomic data resources

Elsevier - Publisher Connector

Caltech Authors

MDC Repository

MPG.PuRe

RNA-Seq gene expression estimation with read mapping uncertainty

Author: Beissbarth
Bo Li
Cloonan
Colin N. Dewey
Dempster
Dohm
Faulkner
Hsu
James A. Thomson
Jiang
Kapur
Lacroix
Langmead
Lister
Marioni
Morin
Mortazavi
Nagalakshmi
Ron M. Stewart
Staden
Victor Ruotti
Wang
Publication venue: Oxford University Press
Publication date
Field of study

Motivation: RNA-Seq is a promising new technology for accurately measuring gene expression levels. Expression estimation with RNA-Seq requires the mapping of relatively short sequencing reads to a reference genome or transcript set. Because reads are generally shorter than transcripts from which they are derived, a single read may map to multiple genes and isoforms, complicating expression analyses. Previous computational methods either discard reads that map to multiple locations or allocate them to genes heuristically

Public Library of Science (PLOS)

Population Genomics: Whole-Genome Analysis of Polymorphism and Divergence in Drosophila simulans

Author: Alisha K Holloway
Andrew D Kern
Charles H Langley
Colin N Dewey
Corbin D Jones
David J Begun
Eugene Myers
Kristian Stevens
LaDeana W Hillier
Lior Pachter
Matthew W Hahn
Mohamed A. F Noor
Phillip M Nista
Yu-Ping Poh
Publication venue: Public Library of Science
Publication date: 01/01/2007
Field of study

The population genetic perspective is that the processes shaping genomic variation can be revealed only through simultaneous investigation of sequence polymorphism and divergence within and between closely related species. Here we present a population genetic analysis of Drosophila simulans based on whole-genome shotgun sequencing of multiple inbred lines and comparison of the resulting data to genome assemblies of the closely related species, D. melanogaster and D. yakuba. We discovered previously unknown, large-scale fluctuations of polymorphism and divergence along chromosome arms, and significantly less polymorphism and faster divergence on the X chromosome. We generated a comprehensive list of functional elements in the D. simulans genome influenced by adaptive evolution. Finally, we characterized genomic patterns of base composition for coding and noncoding sequence. These results suggest several new hypotheses regarding the genetic and biological mechanisms controlling polymorphism and divergence across the Drosophila genome, and provide a rich resource for the investigation of adaptive evolution and functional variation in D. simulans

eScholarship - University of California

Digital Commons@Becker

Carolina Digital Repository

Caltech Authors

RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome

Author: A Mortazavi
A Roberts
A Roberts
B Langmead
B Li
B Paşaniuc
Bo Li
C Trapnell
C Trapnell
Colin N Dewey
ET Wang
F De Bona
G Robertson
GJ Faulkner
H Jiang
H Li
H Richard
J Feng
J Li
JC Marioni
JH Bullard
JS Liu
KD Hansen
KF Au
L Shi
M Guttman
M Nicolae
M Taub
MD Robinson
MG Grabherr
R Morin
S Anders
SA Bustin
U Nagalakshmi
WJ Kent
X Wang
Y Katz
Z Wang
Z Wu
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background RNA-Seq is revolutionizing the way transcript abundances are measured. A key challenge in transcript quantification from RNA-Seq data is the handling of reads that map to multiple genes or isoforms. This issue is particularly important for quantification with de novo transcriptome assemblies in the absence of sequenced genomes, as it is difficult to determine which transcripts are isoforms of the same gene. A second significant issue is the design of RNA-Seq experiments, in terms of the number of reads, read length, and whether reads come from one or both ends of cDNA fragments. Results We present RSEM, an user-friendly software package for quantifying gene and isoform abundances from single-end or paired-end RNA-Seq data. RSEM outputs abundance estimates, 95% credibility intervals, and visualization files and can also simulate RNA-Seq data. In contrast to other existing tools, the software does not require a reference genome. Thus, in combination with a de novo transcriptome assembler, RSEM enables accurate transcript quantification for species without sequenced genomes. On simulated and real data sets, RSEM has superior or comparable performance to quantification methods that rely on a reference genome. Taking advantage of RSEM's ability to effectively use ambiguously-mapping reads, we show that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads. On the other hand, estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired-end reads, depending on the number of possible splice forms for each gene. Conclusions RSEM is an accurate and user-friendly software tool for quantifying transcript abundances from RNA-Seq data. As it does not rely on the existence of a reference genome, it is particularly useful for quantification with de novo transcriptome assemblies. In addition, RSEM has enabled valuable guidance for cost-efficient design of quantification experiments with RNA-Seq, which is currently relatively expensive.</p

Springer - Publisher Connector

Public Library of Science (PLOS)

Fast Statistical Alignment

We describe a new program for the alignment of multiple biological sequences that is both statistically motivated and fast enough for problem sizes that arise in practice. Our Fast Statistical Alignment program is based on pair hidden Markov models which approximate an insertion/deletion process on a tree and uses a sequence annealing algorithm to combine the posterior probabilities estimated from these models into a multiple alignment. FSA uses its explicit statistical model to produce multiple alignments which are accompanied by estimates of the alignment accuracy and uncertainty for every column and character of the alignment—previously available only with alignment programs which use computationally-expensive Markov Chain Monte Carlo approaches—yet can align thousands of long sequences. Moreover, FSA utilizes an unsupervised query-specific learning procedure for parameter estimation which leads to improved accuracy on benchmark reference alignments in comparison to existing programs. The centroid alignment approach taken by FSA, in combination with its learning procedure, drastically reduces the amount of false-positive alignment on biological data in comparison to that given by other methods. The FSA program and a companion visualization tool for exploring uncertainty in alignments can be used via a web interface at http://orangutan.math.berkeley.edu/fsa/, and the source code is available at http://fsa.sourceforge.net/